Feat/backend by Sam-24-dev · Pull Request #4 · Sam-24-dev/Technology-trend-analysis-platform

Sam-24-dev · 2026-02-23T02:16:45Z

This pull request significantly restructures and enhances the weekly ETL pipeline workflow, improves environment configuration options, and updates CI/CD triggers and dependency audit handling. The main focus is on modularizing ETL jobs by data source, improving artifact management, and adding robust validation and publishing steps. Additionally, several environment variables and workflow triggers have been updated for better flexibility and reliability.

ETL Pipeline Refactor and Enhancement:

The .github/workflows/etl_semanal.yml workflow is fully modularized: each data source (GitHub, StackOverflow, Reddit) now runs in its own job, with artifacts uploaded and aggregated in a dedicated aggregation job. This improves parallelism, error isolation, and maintainability. [1] [2] [3]
The aggregation job performs artifact handoff, validates required and optional outputs, runs quality gates, and uploads aggregate artifacts for downstream publishing.
The publish job restores and commits only changed data, with clear English summaries and improved commit messages.

Environment and Configuration Improvements:

New environment variables for controlling data write strategies and trend score engine selection are added to .env.example and set in the ETL workflow, enabling more flexible and transparent configuration. [1] [2]

CI/CD Workflow Updates:

CI and dependency security workflows now trigger on relevant branches (main, feat/backend, feat/frontend), ensuring checks are run for active development streams. [1] [2]
The dependency audit step in .github/workflows/dependency_security.yml now ignores a known NLTK vulnerability (CVE-2025-14009) until a fix is available, preventing unnecessary pipeline failures.

Deployment Workflow Safeguard:

The frontend deployment workflow now only runs for successful workflow runs on the main branch, reducing the risk of unintended deployments.

Most important changes:

ETL Pipeline Modularization and Validation

Refactored .github/workflows/etl_semanal.yml to run ETL jobs for GitHub, StackOverflow, and Reddit as separate jobs, each uploading its own artifacts, followed by an aggregation job that validates outputs, runs quality gates, and uploads aggregate artifacts for publishing. [1] [2] [3]
Added robust artifact handoff and validation steps to ensure all required and optional data files are present before proceeding, with clear error and warning reporting.

Configuration and Environment

Introduced new environment variables in .env.example and set them in the ETL workflow for data write strategies (DATA_WRITE_LEGACY_CSV, etc.) and the trend score engine selector (TREND_SCORE_ENGINE), allowing for granular control of ETL outputs. [1] [2]

CI/CD and Audit Workflow Improvements

Updated CI and dependency audit workflows to trigger on feat/backend and feat/frontend branches, ensuring active feature branches are tested and audited. [1] [2]
The dependency audit step now ignores CVE-2025-14009 for NLTK, preventing unnecessary failures until a fix is available.

Deployment Workflow Safeguard

The frontend deployment workflow is now restricted to only trigger on successful runs from the main branch, reducing accidental deployments from other branches.

User-facing Improvements

All ETL and publish job summaries, commit messages, and error outputs are now in clear English, improving clarity for contributors and reviewers.

Copilot

Pull request overview

This pull request implements a comprehensive V2 backend refactoring for the Technology Trend Analysis Platform, transitioning from a monolithic CSV-only pipeline to a modular, serverless data stack with enhanced quality controls, dual-write capabilities, and frontend bridge support. The changes span 59 files with significant architectural improvements while maintaining backward compatibility.

Changes:

Modularized ETL pipeline with parallel GitHub Actions jobs (GitHub, StackOverflow, Reddit) using artifact-based handoff and aggregation
Implemented dual-write storage strategy supporting legacy CSV, latest snapshots, and date-partitioned history with configurable environment flags
Added severity-based quality gate system with Pandera integration supporting critical/warning/info levels and degradation policies for partial source failures
Introduced DuckDB-based Trend Score engine with equivalence tests validating numeric parity with legacy pandas implementation
Created data product contract system with run/dataset manifests, SemVer versioning, and deterministic schema hashing
Implemented frontend bridge JSON export for historical trend data with feature flag-based partial cutover and CSV fallback

Reviewed changes

Copilot reviewed 44 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`.github/workflows/etl_semanal.yml`	Refactored to parallel job architecture with artifact validation and conditional publishing
`backend/trend_score.py`	Added engine selector supporting legacy pandas and DuckDB implementations
`backend/trend_score_duckdb.py`	New DuckDB-based SQL engine for trend score computation
`backend/validador.py`	Enhanced with Pandera quality checks and severity-based issue routing
`backend/validate_csv_contract.py`	Updated with Pandera integration and configurable validation modes
`backend/quality/pandera_schemas.py`	New module defining dataset schemas and multi-severity quality rules
`backend/quality/degradation_policy.py`	New module implementing source availability degradation matrix
`backend/config/data_product_contract.py`	New contract defining run and dataset manifest structures with validation
`backend/config/schema_contract_utils.py`	New utilities for deterministic schema hashing and SemVer bump recommendations
`backend/sync_assets.py`	Enhanced with latest/legacy prioritization and bridge JSON export integration
`backend/export_history_json.py`	New module generating frontend bridge JSON from history snapshots
`backend/base_etl.py`	Updated with dual-write support for legacy/latest/history destinations
`backend/config/settings.py`	Added write strategy flags and path resolution utilities
`frontend/lib/services/csv_service.dart`	Enhanced with bridge JSON loading and automatic CSV fallback
`frontend/lib/config/feature_flags.dart`	New feature flag system for controlled bridge JSON cutover
`frontend/lib/screens/home_screen.dart`	Added temporal trend view card demonstrating bridge integration
`tests/*`	Comprehensive test coverage for new modules with 133 passing tests
`docs/*`	Updated architecture, contracts, and implementation roadmap documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T02:24:24Z

@@ -1,47 +1,61 @@
-# Política mínima de dependencias y seguridad
+# Politica de Dependencias y Seguridad


The BOM (Byte Order Mark) character \ufeff is present at the beginning of several documentation files. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM from these files for better compatibility.

Suggested change

# Politica de Dependencias y Seguridad

# Politica de Dependencias y Seguridad

Copilot · 2026-02-23T02:24:24Z

@@ -1,85 +1,113 @@
-# Contrato de datos CSV (Backend ↔ Frontend)
+# Contrato de Datos (Backend <-> Frontend)


The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change

# Contrato de Datos (Backend <-> Frontend)

# Contrato de Datos (Backend <-> Frontend)

Copilot · 2026-02-23T02:24:25Z

+        return ["dataset manifest debe ser un objeto (dict/mapping)"]
+
+    for field in DATASET_REQUIRED_FIELDS:
+        if field not in dataset_manifest:
+            errors.append(f"falta campo requerido '{field}'")
+
+    dataset_name = dataset_manifest.get("dataset_logical_name")
+    if "dataset_logical_name" in dataset_manifest and not _is_non_empty_string(dataset_name):
+        errors.append("'dataset_logical_name' debe ser string no vacio")
+
+    version_semver = dataset_manifest.get("version_semver")
+    if "version_semver" in dataset_manifest and not is_valid_semver(version_semver):
+        errors.append("'version_semver' no cumple SemVer")
+
+    generated_at_utc = dataset_manifest.get("generated_at_utc")
+    if "generated_at_utc" in dataset_manifest and not is_valid_iso_utc(generated_at_utc):
+        errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria")
+
+    source_run_id = dataset_manifest.get("source_run_id")
+    if "source_run_id" in dataset_manifest and not _is_non_empty_string(source_run_id):
+        errors.append("'source_run_id' debe ser string no vacio")
+    if expected_run_id and source_run_id != expected_run_id:
+        errors.append("'source_run_id' no coincide con run_id del manifest principal")
+
+    schema_hash = dataset_manifest.get("schema_hash")
+    if "schema_hash" in dataset_manifest:
+        if not _is_non_empty_string(schema_hash) or _HEX64_RE.fullmatch(schema_hash.strip()) is None:
+            errors.append("'schema_hash' debe ser hash sha256 en hexadecimal (64 chars)")
+
+    row_count = dataset_manifest.get("row_count")
+    if "row_count" in dataset_manifest:
+        if not isinstance(row_count, int):
+            errors.append("'row_count' debe ser integer")
+        elif row_count < 0:
+            errors.append("'row_count' no puede ser negativo")
+
+    quality_status = dataset_manifest.get("quality_status")
+    if "quality_status" in dataset_manifest and quality_status not in DATASET_QUALITY_STATUSES:
+        errors.append(f"'quality_status' invalido: {quality_status}")
+
+    latest_path = dataset_manifest.get("latest_path")
+    if "latest_path" in dataset_manifest and not _is_non_empty_string(latest_path):
+        errors.append("'latest_path' debe ser string no vacio")
+
+    history_path = dataset_manifest.get("history_path")
+    if "history_path" in dataset_manifest:
+        if quality_status == "fail":
+            if history_path is not None and not _is_non_empty_string(history_path):
+                errors.append("'history_path' debe ser null o string no vacio cuando quality_status=fail")
+        elif not _is_non_empty_string(history_path):
+            errors.append("'history_path' debe ser string no vacio")
+
+    return errors
+
+
+def validate_run_manifest(run_manifest: Mapping[str, Any]) -> tuple[bool, list[str]]:
+    """Validates minimal structure and rules for a run manifest."""
+    errors: list[str] = []
+
+    if not isinstance(run_manifest, Mapping):
+        return False, ["run manifest debe ser un objeto (dict/mapping)"]
+
+    for field in RUN_REQUIRED_FIELDS:
+        if field not in run_manifest:
+            errors.append(f"falta campo requerido '{field}'")
+
+    run_id = run_manifest.get("run_id")
+    if "run_id" in run_manifest and not _is_non_empty_string(run_id):
+        errors.append("'run_id' debe ser string no vacio")
+
+    generated_at_utc = run_manifest.get("generated_at_utc")
+    if "generated_at_utc" in run_manifest and not is_valid_iso_utc(generated_at_utc):
+        errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria")
+
+    for field in ("source_window_start_utc", "source_window_end_utc"):
+        value = run_manifest.get(field)
+        if field in run_manifest and not is_valid_iso_utc(value):
+            errors.append(f"'{field}' no es ISO-8601 valido con zona horaria")
+
+    quality_gate_status = run_manifest.get("quality_gate_status")
+    if "quality_gate_status" in run_manifest and quality_gate_status not in QUALITY_GATE_STATUSES:
+        errors.append(f"'quality_gate_status' invalido: {quality_gate_status}")
+
+    for field in ("git_sha", "branch"):
+        value = run_manifest.get(field)
+        if field in run_manifest and not _is_non_empty_string(value):
+            errors.append(f"'{field}' debe ser string no vacio")
+
+    datasets = run_manifest.get("datasets")
+    if "datasets" in run_manifest:
+        if not isinstance(datasets, list):
+            errors.append("'datasets' debe ser lista")
+        elif not datasets:
+            errors.append("'datasets' no puede estar vacio")
+        else:
+            for index, dataset_manifest in enumerate(datasets):
+                dataset_errors = validate_dataset_manifest(
+                    dataset_manifest,
+                    expected_run_id=run_id if _is_non_empty_string(run_id) else None,
+                )
+                errors.extend(f"datasets[{index}]: {message}" for message in dataset_errors)


Multiple error messages in this file are in Spanish (e.g., 'dataset manifest debe ser un objeto', 'falta campo requerido', 'debe ser string no vacio', etc.). According to the coding style guide at docs/coding_style.md, backend modules should use English for comments and docstrings. Error messages should also follow this convention for consistency across the codebase. Consider translating these error messages to English.

Copilot · 2026-02-23T02:24:25Z

@@ -0,0 +1,62 @@
+# Estandar de Estilo del Repositorio


The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change

# Estandar de Estilo del Repositorio

# Estandar de Estilo del Repositorio

Copilot · 2026-02-23T02:24:25Z

@@ -1,109 +1,96 @@
-# Architecture -- Technology Trend Analysis Platform
+# Arquitectura del Proyecto


The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change

# Arquitectura del Proyecto

# Arquitectura del Proyecto

Copilot · 2026-02-23T02:24:25Z

-│            Flutter Web Dashboard                     │
-│  4 views · fl_chart · Export ZIP · Responsive        │
-└─────────────────────────────────────────────────────┘
+# Technology Trend Analysis Platform


The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.

Suggested change

# Technology Trend Analysis Platform

# Technology Trend Analysis Platform

Sam-24-dev added 3 commits February 22, 2026 02:51

refactor backend data contracts and output pipeline structure

eca3247

Refactor data pipeline contracts, quality gates, and CI reliability

158ace3

Adjust dependency audit exception for unresolved NLTK CVE

505bd82

Copilot AI review requested due to automatic review settings February 23, 2026 02:16

Copilot started reviewing on behalf of Sam-24-dev February 23, 2026 02:17 View session

Sam-24-dev merged commit c9ea6c5 into main Feb 23, 2026
6 checks passed

Copilot AI reviewed Feb 23, 2026

View reviewed changes

Sam-24-dev deleted the feat/backend branch March 16, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/backend#4

Feat/backend#4
Sam-24-dev merged 3 commits into
mainfrom
feat/backend

Sam-24-dev commented Feb 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,47 +1,61 @@
		# Política mínima de dependencias y seguridad
		# Politica de Dependencias y Seguridad

		@@ -1,85 +1,113 @@
		# Contrato de datos CSV (Backend ↔ Frontend)
		# Contrato de Datos (Backend <-> Frontend)

	# Estandar de Estilo del Repositorio
	# Estandar de Estilo del Repositorio

		@@ -1,109 +1,96 @@
		# Architecture -- Technology Trend Analysis Platform
		# Arquitectura del Proyecto

	# Technology Trend Analysis Platform
	# Technology Trend Analysis Platform

Conversation

Sam-24-dev commented Feb 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants